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Abstract. Using Monte Carlo methods, we simulated the effects of bias 
in generation and elimination of paralogs on the size distribution of par- 
alog groups. It was found that the function describing the decay of the 
number of paralog groups with their size depends on the ratio between 
the probability of duplications of genes and their deletions, which cor- 
responds to different selection pressures on the genome size. Slightly 
different slopes of curves describing the decay of the number of paralog 
groups with their size were also observed when the threshold of homology 
between paralogous sequences was changed. 



1 Introduction 

It is widely accepted that evolution is driven by two random processes - muta- 
tions and recombinations and a directional process - selection. Recombination 
not only re-shuffles genes inside genomes or between genomes but it is also re- 
sponsible for amplification or elimination of sequences. Duplication of complete 
coding sequences produces additional copies of genes called paralogs. Thus, par- 
alogous genes are homologous sequences arisen through gene duplication and 
parallel evolution in one genome. Paralogs can also appear by duplication of 
large fragments of chromosomes or even by fusion of different genomes (allopoly- 
ploidization) . Before the fusion, corresponding sequences in the two genomes 
which had a common ancestor in the past are called orthologs [1] . Since it would 
be very difficult to reproduce their real history, when they appear in the genome 
of one organism they are recognized as paralogs. Paralogs are a source of sim- 
ple redundancy of information, making the genome more stable and resistant to 
mutational effect by complementing the function of one copy when the other is 
damaged by mutation [2] or by reinforcing the function of the amplified gene. 
Most importantly, gene duplication generates a sequence with a defined func- 
tion but released from the selection pressure. Redefinition of the duplicated gene 
function may ameliorate the biological potential of the individual. Taking under 
consideration all the profits brought by paralogs one can ask why the number 
of paralogs seems to be limited. First of all, a higher number of gene copies, 
frequently causing a higher level of products does not mean a more concerted 
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expression of the gene function. The best example - the Down syndrome - is 
caused by redundant information. Second, limitation comes also from the cost 
of replication and translation of information, which leads to selection pressure 
on the genome size. The genome size is the result of compromise between the 
trends for accumulating information and keeping the costs of replication in the 
reasonable limit. Nevertheless, the level of redundancy in genetic information 
is high, for example in a uni-cellular eukaryote organism - Saccharomyces cere- 
visiae (baker's yeast) - probably no more than 20 % of genes fulfill essential 
functions and stay in unique copies. The function of the rest of genes can be 
complemented, probably mostly by paralogous sequences [3] , [4] . 

According to the definition, all the genes in the genome which have a common 
ancestor belong to one paralog family or group. However, the genome analysis 
does not give us direct information about the descent of sequences from the 
common ancestor because we can only conclude about the common progenitor 
on the basis of homology between compared currently "living" sequences. The 
level of homology could additionally indicate the time when the two sequences 
have diverged. Approximately, the number of mutations which have occurred in 
the diverging sequences grows with time linearly, though it may depend on the 
topological character of the duplication itself (i.e. duplication with or without 
inversion) [7] . Furthermore, the fraction of positions in which the two sequences 
differ docs not grow linearly because of multiple substitutions (substitutions 
which have occurred in the same position several times) and reversions whose 
probability grows in time. Thus, the level of homology is not an exact measure 
of divergence time (branching time) . At large time distances the homology be- 
tween two paralogs could be too low to recognise properly whether the observed 
homology is accidental or the compared sequences actually descend from one 
progenitor sequence. That is why a threshold of homology is assumed - if the 
homology level is below the threshold, the compared sequences are considered as 
independently evolved. Since the threshold is arbitrary, and differs in different 
analyses, it is important to find whether the size distribution of paralog families 
depends on the cutoff level of homology. 

In all analyzed genomes the distribution of paralog families follows a specific 
rule. Some authors claim an exponential function [5], others a power law rul- 
ing the frequency of the occurrence of the folds or protein families [6], [8], [9]. 
The latter authors assumed a limited number of the initial sequences evolving 
into the full genome of the contemporary organism. In our simulations we have 
assumed that the evolution of the contemporary genomes has started with all 
the genes indispensable for survival of the individuals and these initial genes 
were independent progenitors of all paralog families. The organisation of these 
genes in higher hierarchy (families or folds) was neglected. We have analysed 
how the size distribution of paralog families depends on the selection pressure, 
on genome size and on the arbitrarily accepted threshold of homology deciding 
about the grouping of the sequences into paralog families. The selection pressure 
is an objective force influencing the genome evolution while the paralog identi- 
fication errors are connected with our ignorance, rather. In our simulations we 
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used two different ways for measuring the distance between paralogs: the first 
one was somewhat absolute because it measured the real time of duplication and 
the second one corresponded to the homology analysis - the Hamming distance 
between two sequences (bit-strings) was measured. 



2 Experimental Distributions 



Analysis of the first completely sequenced genome demonstrated that distribu- 
tions of sizes for paralog families indicate a high level of gene duplication [10]. 
Initial comparison, of bacterial, archeal and eukaryotic genomes has shown that 
the number of sequences in protein families vs. corresponding family sizes dis- 
plays power law distributions [8,11]. 

In contrast, Slonimski et al. [5] in an one page note, reported that for protein 
families of N = 2 to 5 — 6 members, the clusters of N + 1 contain half the 
number of proteins observed in clusters of N, independently of the microbial 
genome size. Their methodology [12], [13] used Smith- Waterman scores SW > 
22, the Z-significance values, and connective-clusters in which a given sequence 
had similarity of Z va i ue > 8 with one or more other sequences. The analysis have 
been performed on yeast and 4 microbial genomes. 

Yanai et al. [9] have compared paralog distributions for 20 genomes, us- 
ing BLAST and E-significance values ranging from E = 10~ 10 to as large as 
E = 10~ 3 . They report linear fits on log- log scale for all genomes, with some- 
what noisy behaviour for larger families. Qian et al. [14] have linked the power 
law distribution of gene families in genomes, with the distribution of structural 
motifs and protein folds, all three displaying identical slope on log-log plots. Their 
analyses involved again 20 microbial genomes, and also inter-genome compar- 
isons within analogous functional and structural families. 

linger et al. [15] compared orthologous gene distributions in three large cu- 
rated databases: COG, ProtoMap, and Prcdom (28031, 81286 and 278584 se- 
quences respectively), and also performed partial analysis of a human genome. 
They again observed a power law behaviour relating the number of sequences 
in structural and functional families F(N) of a given size N, by F(N) oc N~ b , 
where b - the slope of linear fits on log-log plots. Additionally they have linked 
the slopes for small families, and those for large families by 6 50 = 1 + I/6500, 
where 650 and 6500 stand for the 50 smallest, and the 500 largest families, after 
ranking them by size. 

Nimwegen [16] has observed power laws, comparing the number of genes in 
functional categories vs. total number of genes in a genome, with exponents 
varying both between bacterial, archeal and eukaryotic genomes, and especially 
between functional categories: from 0.13 for the protein synthesis in bacteria, to 
as high as 3.36 for the defense response in eukaryotes. 
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Fig. 1. Part a: Slopes of the log- log fittings as a function of the Z a cut-off values. 
Borrelia burgdorferi - pentagrams (850 genes); Haemophilus influenzae - crosses 
(1712 genes); Metanococcus jannaschii - squares (1721 genes), Sulfolobus solfatar- 
icus - triangles (2939 genes), Arabidopsis thaliana - circles (26462 genes). Part 
b: Dependence of the power law exponents on genome size for three Mycoplasma 
bacteria: M.genitalium (486 genes) - squares, M. pneumoniae (687 genes) - tri- 
angles, and M. pulmonis (778 genes) - circles 



3 Current Work 



The Z va i ue [12, 13] data of all intragenomic pairwise alignments for 61 complete 
genomes [18] have been used. In no case an exponential decay for a distribution of 
paralogous family sizes was found, independently of the cut-off threshold of the 
Zvaiue used as a similarity measure. As the Z va i ue depends much on the length of 
compared sequences [12, 13], here we use an amended similarity measure between 
sequences A and B, Z a = Z va i ue (A, B)/ ma,x[Z va i ue (A), Z value (B)}. For identical 
sequences Z a = 1, and it tends to zero with increasing dissimilarity. 

Figure la presents the slopes of the log- log fittings as a function of the Z a cut- 
off values between 0.01 and 0.6 used, for several genomes. For Z a < 0.04 — 0.05, 
for all genomes there are only one or two huge super-clusters, and small frac- 
tions of singletons and doublets (sometimes also triplets). Clearly such a small 
cut-off is too low to distinguish anything of interest. For high values of Z a , but 
obviously depending much on the genome size, most sequences are similar only 
to themselves, and there are mostly singletons, with few still remaining dou- 
blets/triplets. At the less stringent similarity cut-off there are regions of gradual 
change, interspersed by sharp changes in behaviour - corresponding obviously 
to the splitting events, when clusters are broken, and a possible relationship 
between homology and function within family/cluster is disrupted. Somewhere 
in between these two extremes there is a small region of usefulness, when the 
slope of the log-log fits seems to depend more or less linearly on the cut-off Z a 
value. Tentatively it might be attributed to a Z a range of 0.04 - 0.1, as for most 
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log 10 (family size N) 

Fig. 2. Comparison of the log-log plots for Haemophilus influenzae between 
the data from Brenner et al. [10] - circles; the current work: Z a = 0.04 - stars, 
Z a = 0.5 - squares, and Z a = 0.06 - triangles; and when using [12], [13] Z va i ue > 8 
- pentagrams. The steeper solid line, for fitting all but the last point of Brenner's 
data, has the slope -2, the more shallow one the slope of -1.5 uses all points. 
All three methods are based on the use of significance for the Smith- Waterman 
local alignments. 



genomes analysed, we can see a relative plateau of the log-log slope changes with 
increasing Z a . 

Moreover, as can be seen in Fig. la, any comparisons between genomes must 
depend to a high degree on the cut-off value of the similarity measure actually 
used. For example, the data of Brenner et al. [10] for Haemophilus influenzae 
would suggest the slope of the log-log plot equal -1.50, which would imply, if 
compared to the Fig. la, the Z a in between of 0.02 and 0.04, clearly in a twilight 
zone before the supposedly useful region of linear dependence of the slope on Z a . 
However, the last point (Fig. 2, circles) changes the slope of the fit significantly, 
the slope after its exclusion equals -1.98. The corresponding analysis using Z a 
reveals (Fig. 2, stars Z a — 0.04, squares Z a — 0.05, triangles Z a = 0.06) that 
best agreement between ref. 10 and the current work is at Z a = 0.05, and that in 
both cases power law approximation underestimates big-sized families (rightmost 
points, Fig. 2), especially at higher Z a . Finally, the results of cut-off Z va i ue = 8, 
used by Slonimski et al. [5], [12] (Fig. 2, pentagrams), again agree with both 
Brenner's and current results. 

The often emphassized dependence of the fitted log-log slopes on the genome 
size can be observed only as a general trend, with many exceptions. Metanococ- 
cus janaschii and Haemophilus influenzae are of almost identical size of about 
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1700 sequences, but their behaviour is strikingly different, with H. influenzae dis- 
playing the quickest change of slopes with increasing Z a of all genomes analysed. 
Also, H. influenzae large clusters are breaking down to singletons much faster 
(e.g. the rightmost crosses of Haemophilus in Fig. 2, correspond to the bipartite 
composition of vast majority of singletons, and a very small remainder of what 
was before one or two big families). Sulfolobus solfataricus - at approximately 
one tenth the genome size of Arabidopsis thaliana - shows the most shallow de- 
pendence of slopes on Z a of all genomes under study, comparable to that of 
Arabidopsis. Even for the smallest genomes (Fig. lb) of Mycoplasma genitalium 
(486 genes), Mycoplasma pneumoniae (687 genes), and Mycoplasma pulmonis 
(778 genes), Fig. lb (squares, triangles, and circles respectively), which because 
of their taxonomical proximity can be compared directly relatively easy, the size 
dependence of the power law exponent is rather perturbed. 




Fig. 3. Line shows the number nk of families with k paralogs each, independent 
of the bit-string status. The symbols give, for x = 0,1,2,4,,,, from bottom to top, 
the normalized number of paralog pairs within such families of size k. p mu t = 
0.01, b = 0.1 (upper data) and 0.2 (lower data). 



4 Simulations 

The results of earlier modeling efforts can be found in Refs. 2,9,11,14-17,19. In 
our simulations we return to the problem emphasized in the Introduction, the 
number of paralogs for one given function or gene. Thus, in contrast to what 
was described in preceding sections, we assume to know for every part of the 
genome its function. In a simulation that is easy, since we can follow the whole 
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Fig. 4. Average branching time, defined as the number of iterations since the 
last creation of the paralog, versus Hamming distance, from 64,000 samples of 
10,000 ancestors each, with 200 iterations. The fluctuations in this time are 
about as large as the average. Plus signs: p mu t — 0.01, b = 0.1; x crosses: 
Pmut = 0.01, b = 0.2; line: p mut = 0.5, b = 0.1. 



evolution since the beginning; for real genomes, such knowledge in general still 
lies in the future. Our model is a simplification of our earlier one [2], which was 
shown [19] to give reasonable ageing results. 

The simulations start with N bit-strings of length L each, which are zero 
everywhere. Then at each iteration each bit-string with mutation probability 
Pmut selects randomly one of its bits and flips it. Before that, also at every 
time step, for each family (offspring of one ancestor) either the last bit-string is 
deleted (with probability 1/2 + 6) or a randomly selected bit-string is duplicated 
(with probability 1/2 — 6) and then becomes the last; the positive bias 6 keeps 
the number k of copies ("paralogs") for each of the N original bit-strings limited. 
Also, the number k is not allowed to become negative. Thus at any time we have 
for each of the A" ancestors a family consisting of the first bit-string and possibly 
k — 1 additional copies or paralogs, amounting to k bit-strings in total for one 
ancestor (=gene = function). 

The Hamming distance (= number of bits different in a bit-by-bit comparison 
of two bit-strings) was calculated for each paralog with all other bit-strings in 
the same family at the same time, giving (k — l)k/2 Hamming distances. 

The simulations mostly used L = 64, N — 10000, b = 0.1, p mut = 0.01 
for t = 200 iterations and averaged over 64000 samples. Simulations for L = 
8, 16, 32 barely differed in the results when a comparison was possible. The 
average number k of paralogs was nearly 3, i.e. we had nearly two additional 
bit-strings (plus the first one) for each ancestor. Semilogarithmic plots, Fig. 3, 
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of the number of paralog pairs for one ancestor with Hamming distance not 
exceeding x different bits typically gave straight lines with slopes only slightly 
depending on x. x was taken as 0,1,2,4,8,16,32, and 64. For large x the curves 
nearly overlap. For clarity we divided for our figure the number of pairs by the 
normalizing factor [k(k — l)/2] and thus for L = 64 get the total number of 
families. 

The overall distributions rife, lines in Fig. 3, decay exponentially, proportional 
to [(0.5 — 6)/(0.5 + b)] k = 1/1. 5 fc , in the stationary state achieved after dozens 
of iterations for rife, even when the Hamming distances still grow. This formula 
follows from a detailed balance condition that as many families move on average 
from size k to size fc+1 as move in the opposite direction from size fc + 1 to size k. 
Thus the fraction of families with only a single bit-string is 1 — (0.5 — 6)/ (0.5+6) = 
46/(1 + 26) in this geometric series. 

We define the creation of a new paralog as a branching event and store this 
time. At the end of the simulation we determine for each pair within each family 
the last event they branched away from each other; the time between this last 
event and the last iteration of the simulation is the branching time. Within each 
family the branching times fluctuate strongly but their average value for one 
given Hamming distance increases roughly linearly with that Hamming distance, 
until for large Hamming distances the statistics becomes poor, Fig. 4. For longer 
times (500 and 1000 iterations) the linearity improves. 

The above model follows ref.2 except that no selection of the fittest and 
similar complications are included now. Each of the ancestors is interpreted as 
one function (or gene) in the whole organism. The bit-string for this ancestor then 
records important mutations at different places within this gene. The paralogs 
formed in the simulation from this ancestor all refer to this one function. The first 
bit-string undergoes mutations just as its paralogs and has the same properties 
except that it can never be removed. It makes no sense to compare bit-strings for 
different functions; 00101001 means something entirely different for the function 
"brain" than for the function "hair" . The L bits of each bit-string correspond 
to 2 L possible alleles for one function, not to L base pairs. 

The N initial ancestors can also be interpreted as N different samples simu- 
lated for the same function; more generally, they could be M different genomes 
simulated for a genome of N/M functions. 

5 Summary 

We presented here two different sets of plots: In the experimental section we 
found power-law decay for the number of paralogs found by looking through 
the whole genome. In the simulation section we found exponential decays for 
the number of paralogs belonging to one known function. The latter exponential 
decay agrees nicely with simple arguments based on detailed balance; the slopes 
in these semilogarithmic plots (Fig. 3) are determined by our bias in favour of 
removal instead of addition of a paralog, and the slopes barely depend on the 
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cut-off parameter x for the Hamming distance. This agreement of theory with 
simulation also makes clear that our results would be quite different if the bias 
would not be the same for all functions. 
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